307 research outputs found
Multidimensional Range Queries on Modern Hardware
Range queries over multidimensional data are an important part of database
workloads in many applications. Their execution may be accelerated by using
multidimensional index structures (MDIS), such as kd-trees or R-trees. As for
most index structures, the usefulness of this approach depends on the
selectivity of the queries, and common wisdom told that a simple scan beats
MDIS for queries accessing more than 15%-20% of a dataset. However, this wisdom
is largely based on evaluations that are almost two decades old, performed on
data being held on disks, applying IO-optimized data structures, and using
single-core systems. The question is whether this rule of thumb still holds
when multidimensional range queries (MDRQ) are performed on modern
architectures with large main memories holding all data, multi-core CPUs and
data-parallel instruction sets. In this paper, we study the question whether
and how much modern hardware influences the performance ratio between index
structures and scans for MDRQ. To this end, we conservatively adapted three
popular MDIS, namely the R*-tree, the kd-tree, and the VA-file, to exploit
features of modern servers and compared their performance to different flavors
of parallel scans using multiple (synthetic and real-world) analytical
workloads over multiple (synthetic and real-world) datasets of varying size,
dimensionality, and skew. We find that all approaches benefit considerably from
using main memory and parallelization, yet to varying degrees. Our evaluation
indicates that, on current machines, scanning should be favored over parallel
versions of classical MDIS even for very selective queries
Adaptive efficient compression of genomes
Modern high-throughput sequencing technologies are able to generate DNA sequences at an ever increasing rate. In parallel to the decreasing experimental time and cost necessary to produce DNA sequences, computational requirements for analysis and storage of the sequences are steeply increasing. Compression is a key technology to deal with this challenge. Recently, referential compression schemes, storing only the differences between a to-be-compressed input and a known reference sequence, gained a lot of interest in this field. However, memory requirements of the current algorithms are high and run times often are slow. In this paper, we propose an adaptive, parallel and highly efficient referential sequence compression method which allows fine-tuning of the trade-off between required memory and compression speed. When using 12 MB of memory, our method is for human genomes on-par with the best previous algorithms in terms of compression ratio (400:1) and compression speed. In contrast, it compresses a complete human genome in just 11 seconds when provided with 9 GB of main memory, which is almost three times faster than the best competitor while using less main memory
Analysis of Affymetrix Exon Arrays
Exon arrays enable the monitoring of expression on a more fine-grained level than conventional 3’ arrays. By targeting single exons alternative splicing events can be detected. However, the increased amount of data resulting from the denser coverage of the transcribed regions gives rise to new challenges in data analysis compared to 3’ arrays. One must carefully decide which probes are considered for the final analysis to avoid measurements that are not reflecting biological reality. The most outstanding difference between gene level and exon level analysis emerges in the detection of differential expression. To decide whether an exon is differentially expressed between two conditions it must be set in relation to its corresponding gene. Therefore, completely new algorithms need to be applied. This work gives an overview on the analysis of Affymetrix exon arrays. Technical Design, Preprocessing and the detection of alternative splicing are dicussed and finally, a complete workflow is proposed
Finding k-Dissimilar Paths with Minimum Collective Length
Shortest path computation is a fundamental problem in road networks. However,
in many real-world scenarios, determining solely the shortest path is not
enough. In this paper, we study the problem of finding k-Dissimilar Paths with
Minimum Collective Length (kDPwML), which aims at computing a set of paths from
a source s to a target t such that all paths are pairwise dissimilar by at
least \theta and the sum of the path lengths is minimal. We introduce an exact
algorithm for the kDPwML problem, which iterates over all possible s-t paths
while employing two pruning techniques to reduce the prohibitively expensive
computational cost. To achieve scalability, we also define the much smaller set
of the simple single-via paths, and we adapt two algorithms for kDPwML queries
to iterate over this set. Our experimental analysis on real road networks shows
that iterating over all paths is impractical, while iterating over the set of
simple single-via paths can lead to scalable solutions with only a small
trade-off in the quality of the results.Comment: Extended version of the SIGSPATIAL'18 paper under the same titl
- …